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Chapter 1 

Executive Summary 



This chapter highhghts some details for the inpatient readers, while the rest 
of the document provides a lot more details. 

1.1 Brief Introduction and Goals 



An open-source project - MARF (http : //marf . sf . net |Gro06] ) 



which stands for Modular Audio Recognition Framework - originally 
designed for the pattern recognition course. 

• MARF has several applications. Most revolve around its recognition 
pipeline - sample loading, preprocessing, feature extraction, training/- 
classifcation. One of the applications, for example, is Text-Independed 
Speaker Identification Application. The pipeline and the application, 
as they stand, are purely sequential with even little or no concurrency 
when processing a bulk of voice samples. 

• The classical MARF's pipeline is in Figure [2TTj The goal of this work 
is to distribute the shown stages of the pipeline as services as well 
as stages that are not directly present in the figure - sample loading, 
front-end application service (e.g. speaker identification service, etc.) 
and implement some disaster recovery and replication techniques in 
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the distributed system. 

• In Figure |2.2| the design of the distributed version of the pipehne is 
presented. It indicates different levels of basic front-ends, from higher 
to lower which client applications may invoke as well as services may 
invoke other services through their front-ends while executing in the 
pipeline mode. The back-ends are in charge of providing the actual 
servant implementations as well as the features like primary-backup 
replication, monitoring, and disaster recovery modules through dele- 
gates. 

1.2 Implemented Features So Far 

• As of this writing the following are implemented. Most, but not all 
modules work: 

• Out of the following six services: 

1. Speaker Ident Front-end Service (invokes MARF) 

2. MARF Pipeline Service (invokes the remaining four) 

3. Sample Loader Service 

4. Preprocessing Service 

5. Feature Extraction Service (may invoke Preprocessing for prepro- 
cessed sample) 

6. Classification (may invoke Feature Extraction for features) 

all the six work in the stand-alone and pipelined modes in CORBA, 
RMI, and WS. 

At the demo time, the RMI and as a consequence in Web Services 
implementation of the Sample Loader and Preprocessing stages were 
not functional (other nodes were, but could not work as a pipeline) 
because of the design flaw in the MARF itself (the Sample class data 
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structure while itself was Serializable, one of its members, that 
inherits from a standard Java class, has non-serializable members in 
the parent) causing marshalling/unmarshalling to fail. This has been 
addressed until after demo. 

• There arc three clients: one for each communication technology type 
(CORBA, WS, RMI). 

• MARF vs. CORBA vs. RMI object adapters to convert serializable 
objects understood by the technologies to the MARF native and back. 

1.3 Some Design Considerations 

• For WS there are no remote object references, so a class was cre- 
ated called RemoteObjectRef erence encapsulating nothing but a type 
(int) and an URL (String) as a reference that can be passed around 
modules, which can later use it to connect (using WSUtils). 

• All communication modules rely on their delegates for business and 
mosf of the transaction logic, thus remapping remote operations to 
communication-technology idependent logic and enabling cross- technology 
commuincation through message passing. There are two types of dele- 
gates - basic and recoverable. The basic delegates just merely redirect 
the business logic and provide basis for transaction logs while not ac- 
tually implementing the transaction routines. They don't endure the 
transactions overhead and just allow to test the distributed business 
logic. The recoverable delegates are extension of the basic with the 
transactionaly on top of the basic operations. 

• All modules also have utility classes like ORBUtils, RMIUtils, and 
WSUtils. These are used by the distributed modules for common 
registration of services and their look up. Due to the common design, 
these can be looked up at run-time through a reflection by loading the 
requested module classes. The utility modules are also responsible for 
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loading the initial service location information from the dmarf -hosts . 
properties] when available. 



1.4 Transactions, Recover ablity, and WAL Design 

• Write- Ahead Log (WAL) consists of entries called "Transactions". The 
idea is that you write to the log first, ahead of committing anything, 
and once write call (dump) returns, we commit the transaction. 

• A Transaction is a data structure maintaining transaction ID (long), 
a filename of the object (not of the log, but where the object is 
normally permanently stored to distinguish different configurations), 
the Serializable value itself (a Message, TrainingSet, or an entire 
Serializable business-logic module), and timestamps. 

• The WAL's max size is set to empirical 1000 entries before clean up is 
needed. Advantage of keeping such entries is to allow a future feature 
called point-in-time recovery (PITR), backup, or replication. 

• MARF-specific note: since MARF core operations are treated as kind 
of a business logic black box, the "transactions" are similar to the 
"before" and "after" snapshots of serialized data (maybe a design flaw 
in MARF itself, to be determined). 

• Checkpointing in the log is done periodically, by default every second. 
A checkpoint is set to be a transaction ID latest committed. Thus, in 
the event of a crash, to recover, only committed transactions with the 
ID greater than the checkpoint are recovered. 

1.5 Configuration and Deployment 



All CORBA, RMI, and WS use a'dmarf -hosts .properties files at startup 



if available to locate where the other services are and where to register 
themselves. 
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Web Services have Tomcat context XML files for frosting as weff as |Web 
l^and refated WSDL XML fifes. 

Aff such things are scripted in the GNU Make |SMSPOQ| 
Ant |Con05] dmarf -build. xml mal^efifes. 



Makefile 



and 



1.6 Testing 

A Mafcefife target marf -client-test for a singfe wave fife and a |batch. sh 
sheff script test mostfy CORBA pipefine with 295 testing sampfes and 31 
testing wave sampfes x 4 training configs x 16 testing configs. 

The fargest demo experiment invovfed onfy four machines in two different 
buifdings running the 6 services and a cfient (a some machines ran more than 
one service of each fcind). Kiffing any of the singfe services in batch mode 
and then restarting it, recovered the abifity of a pipefine to operate normaffy. 



1.7 Known Issues and Limitations 

• After fong runs of aff six CORBA services on the same machine runs 
out of fife (and soclcet) descriptors reaching defauft kernef fimits. (Prob- 
abfy due to farge number of fog fifes opened and not cfosed whife the 
containing JVM does not exit and which accumufate over time after 
fots of rigorous testing) . 

• Main MARF's design ffaws making the pipefine rigid and fess concur- 
rent (five-fayer nested transaction, see startRecognitionPipeline () 
of MARFServerCORBA, MARFServerRMI, or MARFServerWS for exampfes. 

• Transaction ID "wrap-around" for fong-running system and transac- 
tions with fots of message passing and other operations. MARF does 
a fot of writes (dumps) and fong-running servers have a potentiaf to 
have their transaction IDs be recycfed after an overffow. At the time 
of this writing, there is no an estimate of how fog it might take when 
this happens. 
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• All services are single-threaded in the proof-of-concept implementa- 
tion, so the concurrency is far from being fully exploited per server 
instance. This is to be overcome in the near future. 

1.8 Partially Implemented Planned Features 

• WAL logging and recovery. 

• Message passing (for gossip, TPC or UDP -|- FIFO) is to be added to 
the basic delegates. 

• Application and Status Monitor GUI - the rudiments are there, but 
not fully integrated yet. 

1.9 NOT Implemented Planned Features 

• Primary-backup replication with a "warm stanby" . 

• Lazy, gossip-based replication for Classification training sets. 

• Two-phase commit for nested MARF Service transactions (covering 
the entire pipeline run. 

• Distributed System-ware NLP-related applications. 

• Thin test clients and their GUI. 

1.10 Conclusion 

This proof-of-concept implementation of Distributed MARF has proven a 
possibility for the pipeline stages and not only to be executed in a pipeline 
and stand-alone modes on several computers. This can be useful in pro- 
viding any of the mentioned services to clients that have low computational 
power or no required environment to run the whole pipeline locally or cannot 
afford long-running processing (e.g. collecting samples with a laptop or any 
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mobile device and submitting them to the server). Additionally, there were 
discovered some show-stopping design flaws in the classical MARF itself that 
have to be corrected, primarily related to the storage and parameter passing 
among modules. 



1.11 Future Work 

Address the design flaws, limitations, and not-implemented features and 
release the code (for future improvements). You may volunteer to help to 
contribute these ;-) as well as addressing the bugs and limitations when there 



is a time and desire. Please email to mokhovScse . concordia. ca if you are 



intrested in contributing to the Distributed MARF project. 



Chapter 2 

Introduction 



Revision : 1.3 

This chapter briefly presents the purpose and the scope of the work 
on the Distributed MARF project with a subset of relevant requirements, 
definitions, and acronyms. All these aspects are detailed to some extent 
later through the document. The application ideas in small part are coming 
from ICDK051 IWW051 IMic04l IMicOSbi IMicOGl IGroOGl lAfokOGbl . 



2.1 Requirements 

I have an open-source project - MARF fhttp : //marf . sf . net |Gro06] ) . 



which stands for Modular Audio Recognition Framework. Originally de- 
signed for the pattern recognition course back in 2002, it had addons from 
other courses I've taken and maintained and released it relatively regularly. 

MARF has several applications. Most revolve around its recognition 
pipeline - sample loading, preprocessing, feature extraction, training/clas- 
sifcation. One of the applications, for example is Text-Independed Speaker 
Identification. The pipeline and the application as they stand are purely se- 
quential with even little or no concurrency when processing a bulk of voice 
samples. Thus, the purpose of this work is to make the pipeline distributed 
and run on a cluster or a just a set of distinct computers to compare with 
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Figure 2.1: The Core MARF Pipeline Data Flow 



the traditional version and add disaster recovery and service replication, 
communication technology indepedence, and so on. 



The classical MARF's pipeline is in Figure 2.1 The goal of this work is 
to distribute the shown stages of the pipeline as services as well as stages 
that are not directly present in the figure - sample loading, front-end appli- 
cation service (e.g. speaker identification service, etc.) and implement some 
disaster recovery and replication techniques in the distributed system. 



In Figure 2.2 the distributed version of the pipeline is presented. It 
indicates different levels of basic front-ends, from higher to lower, which a 
client application may invoke as well as services may invoke other services 
through their front-ends while executing in a pipeline-mode. The back-ends 
are in charge of providing the actual servant implementations as well as the 
features like primary-backup replication, monitoring, and disaster recovery 
modules. 

There are several distributed services, some are more general, and some 
are more specific. The services can and have to intercommunicate. These 
include: 
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Figure 2.2: The Distributed MARF Pipeline 
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• General MARF Service that exposes MARF's pipeline to clients and 
other services and communicates with the below. 

• Sample Loading Service knows how to load certain file or stream types 
(e.g. WAVE) and convert them accordingly for further preprocessing. 

• Preprocessing Service accepts incoming voice or text samples and does 
the requested preprocessing (all sorts of filters, normalization, etc.). 

• Feature Exraction Service accepts data, presumably preprocessed, and 
attempts to extract features out of it given requested algorithm (out 
of currently implemented, like FFT, LPC, MinMax, etc.) and may 
optionally query the preprocessed data from the Preprocessing Service. 

• Classifcation and Training Service accepts feature vectors and either 
updates its database of training sets or performs classification against 
existing training sets. May optionall query the Feature Extraction 
Service for the features. 

• Natural Language Processing Service accepts natural language texts 
and performs also some statistical NLP operations, such as probabilis- 
tic parsing, Zipf's Law stats, etc. 

Some more apphcation-specific front-end services (that are based on the 
existing currently non-distributed apps) include but not limited to: 

• Speaker Identification Service (a front-end) that will communicate 
with the MARF service to carry out application tasks. 

• Language Identification Service would communicate with MARF /NLP 
for the similar purpose. 

• Some others (front-ends for Zipf's Law, Probabilistic Parsing, and test 
applications) . 
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The clients are so-called "thin" clients with GUI or a Web Form allowing 
users to upload the samples for training/classification and set the desired 
configuration for each run, either for individual samples or batch. 

Like it was done in the Distributed Stock Broker |Mok06b] . the archi- 
tecture is general and usable enough to enable one or more services using 
CORBA, RMI, Web Services (WS), Jini, JMS, sockets, whatever (weh, ac- 
tually, Jini, JMS were not implemented in either applications, but it is not 
a problem to add with little or no "disturbance" of the rest of the architec- 
ture). 

2.2 Scope 

In the Distributed MARF, if any pipeline stage process crashes access to 
information about the pending transactions and computatiion in module 
is not only lost while the process remains unavailable but can also be lost 
forever. 

Use of a message logging protocol is one way that a module could recover 
information concerning that module's data after a faulty processor has been 
repaired. A WAL message-logging protocol is developed for DMARF. The 
former is for the disaster recovery of uncommitted transactions and to avoid 
data loss. It also allows for backup replication and point-in-time recovery if 
WAL logs are shipped off to a backup storage or a replica manager and can 
be used to reconstruct the replica state via gossip or any other replication 
scheme. 

The DMARF is also extended by adding a "warm standby" . The "warm 
standby" is a MARF module that is running in the background (normally 
on a different machine), receiving operations from the primary server to 
update its state and hence ready to jump in if the primary server fails. 
Thus, when the primary server receives a request from a client which will 
change its state, it sends the request to the backup server, performs the 
request, receives the response from the backup server and then sends the 
reply back to the client. The main purpose of the "warm stand by" is to 
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minimise the downtime for subsequent transactions while the primary is 
in disaster recovery. The primary and backup servers communicate using 
either the rehable TCP protocol (over WAN) or a FIFO-ordered UDP on 
a LAN. Since this is a secondary feature and the load in this project will 
be more than average, we simply might not have time to do and debug this 
stuff to be reliable over UDP, so we choose TCP do it for us, like we did in 
StockBroker Assignment 2. IFF we have time, we can try to make a FIFO 
UDP communication. 

• Design and implement the set of required interfaces in RMI, CORBA, 
and WS for the main MARF's pipeline stages to run distributedly, 
including any possible application front-end and client applications. 

• Assuming that processor failures are benign (i.e. crash failures) and 
not Byzantine, analysis of the classical MARF was done to determine 
the information necessary for the proper recovery of a MARF module 
(that is, content of the log) and the design of the "warm standby" 
replication system. 

• Modify MARF implementation so that it logs the required information 
using the WAL message-logging protocol. 

• Design and implement a recovery module which restarts a MARF mod- 
ule using the log so that the restarted module can process subsequent 
requests for the various operations. 

• Design and implement the primary server which receives requests from 
clients, sends the request to the backup server, performs the request, 
and sends the response back to the client only after the request has 
been completed correctly both in the primary and the backup servers. 
When the primary notices that the backup does not respond within 
a reasonable time, it assumes and informs the MARF monitor that 
the backup has failed so that a new backup server can be created and 
initialized. 
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• Design and implement a monitor module which periodically checks 
the module process and restarts it if necessary, application. This 
monitor initializes the primary and backup servers at the beginning, 
creates and initializes a backup server when the primary fails (and 
the original backup server takes over as the primary) , and creates and 
initializes a backup server when the original backup server fails. 

• Design and implement the backup server which receives requests from 
the primary, performs the request and sends the reply back to the 
primary. If the backup server does not receive any request from the 
primary for a reasonable time, it sends a request to the primary to 
check if the latter is working. If the primary server does not reply in 
a reasonable time, the backup server assumes that the primary has 
failed and takes over by configuring itself as the primary so that it can 
receive and handle all client requests from that point onwards; and 
also informs the broker monitor of the switch over so that the latter 
can create and initialize another backup server. 

• Integrate all the modules properly, deploy the application on a local 
area network, and test the correct operation of the application using 
properly designed test runs. One may simulate a process failure by 
killing that process from the command line while the application is 
running. 

2.3 Definitions and Acronyms 

API Application Programmers Interface - a common convenience collection 
of objects, methods, and other object members, typically in a library, 
available for an application programmer to use. 

CORBA Common Object Request Broker Architecture - a language model 
independent platform for distributed execution of applications possibly 
written in different languages, and, is, therefore, heterogeneous type 
of RPC (unlike Java RMI, which is Java-specific). 
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HTML HyperText Markup Language - a tag-based language for defining 
the layout of web pages. 

IDL Interface Definition Language - a CORBA interface language to "glue" 
most common types and data structures in a specific programming 
language-independent way. Interfaces written in IDL are compiled to a 
language specific definitions using defined mapping between constructs 
in IDL and the target language, e.g. IDL-to-Java compiler (idlj) is 
used for this purpose in this assignment. 

CVS Concurrent Versions System - a version and revision control system 
to manage source code repository. 

DSB Distributed Stock Broker application. 

DMARF Distributed MARF. 

J2SE Java 2 Standard Edition. 

J2EE Java 2 Entreprise Edition. 

JAX-RPC Java XML-based RFC way of implementing Web Services. 

J AX-WS The new and re-engineered way of Java Web Services implemen- 
tation as opposed to the older and being phased-out Java XML-RFC. 

JDK The Java Development Kit. Frovides the JRE and a set of tools 
(e.g. the javac, idlj, rmic compilers, javadoc, etc.) to develop and 
execute Java applications. 

JRE The Java Runtime Environment. Frovides the JVM and required 
libraries to execute Java applications. 

JVM The Java Virtual Machine. Frogram and framework allowing the 
execution of program developed using the Java programming language. 

MARF Modular Audio Recognition Framework |Gro06j has a variety of 
useful general purpose utility and storage modules employed in this 
work, from the same author. 



CHAPTER 2. INTRODUCTION 



18 



RMI Remote Method Invocation - an object-oriented way of calling meth- 
ods of objects possibly stored remotely with respect to the calling 
program. 

RPC A concept of Remote Procedure Call, introduced early by Sun, to 
indicate that an implementation certain procedure called by a client 
may in fact be located remotely from a client on another machine 

SOAP Simple Object Access Protocol - a protocol for XML message ex- 
change over HTTP often used for Web Services. 

STDOUT Standard output - an output data stream typically associated 
with a screen. 

STDERR Standard error - an output data stream typically associated 
with a screen to output error information as opposed to the rest of the 
output sent to STDOUT. 

WS Web Services - another way of doing RPC among even more hetero- 
geneous architectures and languages using only XML and HTTP as a 
basis. 

WSDL Web Services Definition Language, written in XML notation, is a 
language to describe types and message types a service provides and 
data exchanged in SOAP. WSDL's purpose is similar to IDL and it 
can be used to generate endpoint interfaces in different programming 
languages. 



Chapter 3 

System Overview 



Revision : 1.2 

In this chapter, we examine the system architecture of the implementa- 
tion of the DMARF appHcation and software interface design issues. 

3.1 Architectural Strategies 

The main principles are: 

Platform-Independence where one targets systems that are capable of 
running a JVM. 

Databsise-Independent API will allow to swap database/storage engines 
on-the-fiy. The appropriate adapters will be designed to feed upon re- 
quired/available data source (binary, CSV file, XML, or SQL) databases 

Communication Technology Independence where the system design 
evolves such that any communication technologies adapters or plug- 
ins (e.g. RMI, CORBA, DCOM+, Jini, JMS, Web Services) can be 
added with little or no change to the main logic and code base. 

Reasonable Efficiency where one architects and implements an efficient 
system, but will avoid advanced programming tricks that improve the 
efficiency at the cost of maintainability and readability. 
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Simplicity and Maintainability where one targets a simplistic and easy 
to maintain organization of tlie source. 

Architectural Consistency wliere one consistently implements the cho- 
sen architectural approach. 

Separation of Concern where one isolates separate concerns between mod- 
ules and within modules to encourage re-use and code simplicity. 

3.2 System Architecture 

3.2.1 Module View 
Layering 

The DMARF application is divided into layers. The top level has a front- 
end and a back-end. The front-end itself exists on the client side and on the 
server side. The client side is either text-interactive, non-interactive client 
classes that connect and query the servers. The front-end on the server side 
are the MARF pipeline itself, the application-specific frontend, and pipeline 
stage services. All pipeline stages somehow involved to the database and 
other storage management subfunctions. At the same time the services are 
a back-end for the client connecting in. 

3.2.2 Execution View 
Runtime Entities 

In the case of the DMARF application, there is hosting run-time environ- 
ment of the JVM and on the server side there must be the naming and imple- 
mentation repository service running, in the form of orbd and rmiregistry. 
For the WS aspect of the application, there ought to be DNS running and a 
web servlet container. The DBS uses Tomcat jFouOS] as a servlet container 
for MARF WS. The client side for RMI and CORBA clients just requires a 
JRE (1.4 is the minimum). The WS client in addition to JRE may require 
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a servlet container environment (here Tomcat) and a browser to view and 
submit a web form. Both RMI and CORBA chent and server apphcations 
are stand-alone and non-interactive. A GUI is projected for the client (and 
possibly server to administer it) in one of the follow up versions. 

Communication Paths 

It was resolved that the modules would all communicate through message 
passing between methods. CORBA is one of the networking technologies 
used for remote invocation. RMI is the base-line technology used for remote 
method calls. Further, a JAX-RPC over SOAP is used for Web Services 
(while a more modern JAX-WS alternative to JAX-RPC was released, this 
project still relies on JAX-RPC 1.1 as it's not using J2EE and the author 
found it simpler and faster to use given the timeframe and more accurate tu- 
torial and book material available). All: RMI, CORBA, and WS influenced 
some technology-specific design decisions, but it was possible to abstract 
them as RMI and CORBA "agents" and delegate the business logic to del- 
egate classes enabling all three types of services to communicate in the fu- 
ture and implement transactions similarly. Communication to the database 
depends on the storage manager (each terminal business logic module in 
the classical MARF is a StorageManager) . Additionally, Java's reflection 
|Gre05] is used to discover instantiation communication paths at run-time 
for pluggable modules. 



Execution Configuration 



The execution configuration of the DMARF has to do with where itsldata/ 
and [policies/ directories are. The data/ directory is always local to where 
the application was ran from. In the case of WS, it has to be where Tomcat's 
current directory is; often is in the |logs/ directory of ${catalina.base}. 



The data directory contains the service-assigned databases in the XXX . gzbinj 
(generated on the first run of the servers) . The "XXX" corresponds to the 
either training set or a module name that saved their state. Next, orbd keeps 
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its data structures and logs in the|orb.db/ directory also found in the current 
directory Additionally, the RMI configuration for application's (both client 
and server) policy files is located in|allallow. policy] (testing with all per- 



missions enabled). As for the WS, for deployment two directories META-INF/ 



and I WEB - INF/| are used. The former contains the Tomcat's contex file for de- 
ployment that ought to be placed in ${cataliiia.base}/ conf /Catalina/localhost/ 



and the latter typically goes to local/marf as the context describes. It 



contains web . xml and other XML files prduced to describe servlet to SOAP 



mapping when generating || . war files with wscompile and wsdeploy. 
The build-and-run files include the Ant |ConQ5] 



dmarf -build . xml 



and 



the GNU make |SMSPOO) Makefile files. The |Makef ilej is the one capable 
of starting the orbd, rmiregistry, the servers, and the clients in various 
modes. The execution configuration targets primarily Linux FC4 platform 
(if one intends to use gmake), but is not restricted to it. 



A hosts configuration file dmarf -hosts .properties is used to tell the 
services of how to initialize and where to find other services initially. If the 
file is not present, the default of host for all is assumed to be localhost . 



3.3 Coding Standards and Project Management 

In order to produce higher-quality code, it was decided to normalize on 
Hungarian Notation coding style used in MARF |Mok06a] . Additionally, 
javadoc is used as a source code documentation style for its completeness 
and the automated tool support. CVS (cvs) f BddzzP"'"05 was employed in 
order to manage the source code, makefile, and documentation revisions. 



3.4 Proof-of- Concept Prototype Assumptions 

Since this is a prototype application within a timeframe of a course, some 
simplifying assumptions took place that were not a part of, explicit or im- 
plied, of the specification. 
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1 . There is no garbage collection done on the server side in terms of fully 
limiting the WAL size. 

2. WAL functinality has not been at all implemented for the modules 
other than Classification. 

3. MARF services does not implement nested transaction while pipelin- 
ing. 

4. Services don't intercommunicate (TCP or UDP) other than through 
the pipeline mode of operation. 

5. No primary-backup or otherwise replication is present. 



3.5 Software Interface Design 

Software interface design comprises both user interfaces and communication 
interfaces (central topic of this work) between modules. 



3.5.1 User Interface 

For the RMI and CORBA clients and servers there is a GUI designed for sta- 
tus and control as time did not permit to properly integrate one. Therefore, 
they use a command-line interface that is typically invoked from a provided 
[Makefile^ GUI integration is projected in the near future. See the interface 



prototypes in Figure 3.1 and in Figure 3.2 



3.5.2 Software Interface 

Primary communication-related software interfaces are briefly described be- 
low. A few other interfaces are omitted for brevity (of storage and classical 
MARF). 
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RMI 

The main RMI interfaces the RMI servants implement are ISpeakerldentRMI, 
IMARFServerRMI, ISampleLoaderRMI, IPreprocessingRMI, IFeatureExtractionRMI, 



and IClassif icationRMI. They are located in the marf . server .rmi . * and 



marf . client . rmi . * packages. There also are the generated files off this in- 



terface for stubs with rmic and the servant implementation. 



CORBA 



The main CORBA IDL interfaces the servants implement are ISpeakerldentCORBA, 
IMARFServerCORBA, ISampleLoaderCORBA, IPreprocessingCORBA, IFeatureExtractionCORBA, 
and IClassif icationCORBA. The IDL files are located in the lmarf . ser verTI 
[cor ba. *1 package and are called MARF . idl and Frontends . idl| There also 



are the generated files off this interface definition for stub, skeleton, data 
types holders and helpers with idlj and the servant implementation and a 
data type adapter (described later). 

WS 

The main WS interface the WS "servants" (servlets) implement is ISpeakerldentWS, 
IMARFServerWS, ISampleLoaderWS, IPreprocessingWS, IFeatureExtractionWS, 



and IClassif icationWS. They are located in the marf . server .ws . * There 
are also the generated files off this interface for stub and skeleton serializ- 
ers and builders for each method and non-primitive data type of Java with 
wscompile and wsdeploy and the "servant" implementations. There are 
about 8 files generated for SOAP XML messages per method or a data type 
for requests, responses, faults, building, and serialization. 

Delegate 

The DMARF is flexible here and allows any delegate implementation as 



long as IDelegate in marf .net . server .delelegates is implemented. A 



common implementation of it is also there provided with the added value 
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benefit that all three types of servants of the above can use the same delegate 
implementations and therefore can share all of functionality, transactions, 
and communication. 

3.5.3 HcirdwEire Interface 

Hardware interface is fully abstracted by the JVM and the underlying oper- 
ating system making the DMARF application fully architecture- independent. 
The references to STDOUT and STDERR (by default the screen or a 
file) are handled through the System. out and System. err streams. Like- 
wise, STDIN (by default associated with keyboard) is abstracted by Java's 
System. in. 
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Figure 3.1: Speaker Men App Client GUI Prototype 
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Figure 3.2: MARF Service Status Monitor GUI Prototype 



Chapter 4 

Detailed System Design 



Revision : 1.2 

This chapter briefly presents the design considerations and assumptions 
in the form of directory structure, class diagrams as well as storage organi- 
zation. 



4.1 Directory and Package Organization 

In this section, the directory structure is introduced. Please note that Java, 
by default, converts sub-packages into subdirectories, which is what we see in 



Figure |4T| Please also refer to Table [4~T] and Table [42] for description of the 
data contained in the directories and the package organization, respectively. 
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Directory 


Description 


|bin/| 




compiled class files are kept here. The sub-directory 
structure mimics the one of the src/, 


data/ 




contains the database as well as [stocks 


. file. 


logs/ 






contains the client and server log files as "screen- 
shots". 


orb . db/ 




contains the naming database as well as logs for the 
orbd. 


doc/ 




project's API and manual documentation (well and 
theory as well). 


lib/ 




meant for libraries, but for now there is 


none 


src/ 




contains the source code files and follows the de- 
scribed package hierarchy. 


dist/' 


contains distro services ' . j ar' and ' . war 


files. 


policies/^ 


access policies for the RMI client and server granting 
various permissions. 


META-INF/ 


Tomcat's context file (and later manifest) for deploy- 
ment .war, 


WEB-INF/| 


WS WSDL servlet-related deployment information 
and classes. 



Table 4.1: Details on Main Directory Structure 
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Package 


Description 


marf 




root directory of the MARF project; below are the 
packages mostly pertinent to the DMARF 




. * . * 




MARF's directory for the some generic networking 

cf nfF 
S L Uli 


TnaT"f no"t" 

iiiCLl. X > lie b 


. client 




client application code and subpackages 


lilcLX ± . IIU L 


. client . corbc 


1.* 




Flicf riKiif orl A/TARF PORRA r^lionfc 
UlSLllUULeQ IVi/YrLr ^wXVlJ-ri- ClieilLo 


IXlclx 1 . 11 L 


. client .rmi . * 




T^io+T'iKTi + /=»r4 A/TART? RIV/TT nln/ziTi+c 

uistriDuteu iviArtr ruvii Clients 


lUdX ± . 11^ L 


. client .ws . * 




i-'iDiiiuLiieLi ivixA.x\r wo ciieiiio 


UiCLX X • 11c u 


.messenging. =1 




riGbcrvcQ lor irieoocige-pciooiiig, proiocoio 


maT"f Tio'l" 

ILLCLX X • lie L 


.protocol . * 




rvcbcl VcU. iUl OLIitii piULUCUio, ilKc IWU-piicioc LUillilllL 


nidx ± . 11^ L 


. server . * 






main server code and. interfaces is placed here 


mcLxi . neT. 


. server .rmi . =1 




rvivii-bpeciiic seviceo iiiipieiiieiiLaLiuii 


marf . net 


. server . corbc 


1.* 




wxvji)iT.-opeciiic se Vices iiiipieiiieiiiaiiuii 


nidX ± . 11 13 L 


. server .ws . * 




vv o-specmc sevices impiemenxaxion 


marf . net 


. server . delegates . * 




service delegate implementations are here 


marf . net 


. server . f rontend . * 




root of the service front-ends 


marf . net 


. server . f rontend . rmi . i 




RJVII-specific service front-ends 


marf . net 


. server . f rontend . corba . * 


CORBA-specific service front-ends 


marf . net 


. server . f rontend . ws . *' 




WS-specific service front-ends 


marf . net 


. server . f rontend . delegates . * 


service front-ends delegate implementations 


marf . net 


. server .goss] 




reserved for the gossip replication implementation 


marf . net 


. server. guil 




server status GUI 


majrf . net 


. server .monitoring 




reserved for various service monitors and their boot- 
strap 


marf . net 


. server .persistence 




reserved for WAL and Transaction storage manage- 
ment 


marf . net 


. server . recover] 


r 




reserved for WAL recovery and logging 


marf . Storage 




MARF's storage-related utility classes 


marf .util 




MARF's general utility classes (threads, loggers, ar- 
ray processing, etc.) 


marf . gui 




general-purpose GUI utilities that to be used in the 
MARF apps, clients, and server status monitors 



Table 4.2: DMARF's Package Organization 
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Figure 4.1: Package Structure of the Project 
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4.2 Class Diagrams 

At this stage, the entire design is summarized in five class diagrams rep- 
resenting the major modules and their relationships. The diagrams of the 



overall architecture and its storage subsystem are in Figure 4.3 and Fig- 
ure 



4.4 respectively. Then, some details on CORBA, RMI, and WS imple- 
mentations are in Figure ??, Figure ??, and Figure ?? respectively. Please 
locate the detailed description of the modules in the generated API HTML 



off javadocs or the javadoc comments themselves in the doc/api] directory. 
Some of the description appears here as well in the form of interaction be- 
tween classes. 

At the beginning of the hierarchy are the IClient and IServer are 
independent of a communication technology type of interfaces that "mark" 
the would-be classes of either type. This is design of a system where one 
will be able to pick and choose either manually or automatically which 



communication technologies to use. These interfaces are defined in the |niarf . 
Inetl and used in reflection instantiation utils. 

Next, the hierarchy branches to the CORBA, RMI, and WS marked-up 
sever and client interfaces, ICORBAServer, ICORBAClient, IRMIServer and 
IRMIClient, IWSServer and IWSClient. The specificity of the IRMIServer 
that it extends the Remote interface required by the RMI specification. The 
ICORBAServer allows to set and get the root POA. And the IWSServer al- 
lows setting and getting an in-house made RemoteObjectRef erence (which 
isn't true object reference as in RMI or CORBA, but incapsulates the nec- 
essary service location information). 

Then, the diagram shows only the CORBA details (and RMI and WS are 
similar, but the diagram is already cluttered, so they were omitted). Then 
the diagram shows all six servants and their relationships with the interfaces 
as well as blending in WAL logging and transaction recovery. There some 
monitoring modules designed as well. 



The clients for the respective technologies are in the meirf .net . client 



corba marf . net . client . rmi and marf . net . client . ws| packages 



CHAPTER 4. DETAILED SYSTEM DESIGN 



33 



SpeakerldentApp 
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SampleLoader 
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Preprocessing 



FeatureExtraction 



startRecogntionPip^lineO 



□ 



loadSamplef ile) 



getResult[) 



Sample[) 



ConcretBPreprocessing(Sample) 



: Classification 


: Result 







MARF Recognition Pipeline 1^ 



gfetSampleAtrayO 




preprocess() 



normalize[ ) 



□ 



generate preprocessed sample 



□ 



FeatureExtractiorj (Preprocessing) 



getSampleO 
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Classificationf eatureExtradion) 



classify[ ) 



getResultO 



generate feature vector 
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Figure 4.2: Sequence Diagram of the Pipeline Of Invocations 



CHAPTER 4. DETAILED SYSTEM DESIGN 



34 



SpeakerldentSetvant 

(from Spsakerldsnt] 



IRMICIietit ICORBAClie 

Cfro[ 




(from 




SpeakerldertCORBACIient 

[tromSpedkarlde nt) 




-oSpeakerldf 
■\ 


ntFERef 



MAR F Object Adapter 

ijrom utill 


ORBUtils 

(from Ltil) 







RecoverableClassifcationDelegate 

[trom Clarification) 



^RecoverableClassificationDelegateO 

^abortTransactionO 

^beginTransactionQ 

^commitTrarsactionO 

^erdTransactionO 

^preliminaryCompleteTransactionO 

^prepareTransactionO 

^requestTransactionO 



!Cl3ssific3tionCORBAPOA 

(from Classiliojtion) 



fcessirgRemate 

^0 

WessingCORBA 



ClassifcationServant 

(Iiom Clarification) 



FeatureE«tractionServant 
(frcm FejIureExtaction) 



PreprocessingSeivant 
(from Preprooessirg) 



(frcm Clasiliojiion) 



-oDelegate 





ICIassificationDelegateV 

ffromClasalicjlion) 



*initO 



#oWALLiigger 



WriteAheadLogger 



?^>iCheckPoint : long 



^riteAheadLoggerO 

^runO 

^mainO 



TransactionCoordinator 



StaleMateMonitor 

(from moriloring] 



♦staleMateMoritorO 



ServiceMonitor 

(trom mcniioririg) 

'festrCommandLine : String - 



*SetviceMonitorO 
*5tartupO 
*shutdwQnO 
*mainO 



#oTran5actionLog 



WAL 

(Irom peeisteroe) 



^ DEFAULT_MA>:_L0G_5IZE : int = 1000 
^ITxnCheckpoint : long = 

^oLogData . Vector - new Vector [DEFAULT_MA>;_LOG_SIZE) 
♦WALO 

*addTransactionO 
*remove Transact ionO 
*truncateLogO 
*getTxnCheckpointO 
*setTKnCheckpDintO 



♦setLogDataO 



Transaction 

rom psr^idoros) 



^ TXN IDLE : int = Q 
* TXN_PENDING : int = 1 
* TXN_COMMinED : ml = 2 
* T:<N_ABQRTED . int -3 
-^■i#iStatus: int = TKN_IDLE 
■i^strFilename ■ String - "" 
■i^iAction : int 

■^oBeginTimestamp : Date = new D 
l^oEndTimestamp : Date = null 
l^iTransactionlD : long 



Bootstrap 

rom roc riito ring) 



*BootstrapO 
*mainO 



WALStorageManager 
(Irompersirtenoe) 




WALReccwery 

[from r^cov^rv) 


< 






*WALRecoveryO 



Figure 4.3: General Architecture Class Diagram of marf.net 
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When implementing the CORBA services, a data type adapter had to be 



made to adapt certain data structures that came from MARF. idTj to the com- 
mon storage data structures (e.g. Sample, Result, CommunicationException, 
ResultSet, etc.). Thus, the MARFObjectAdapter class was provided to 
adapt these data structured back and forth with the generic delegate when 
needed. 



The servers for the respective technologies are in the mcirf .net . server. 



corba, marf . net . server . rmi and marf . net . server . ws] packages. 

Finally, on the server side, the RecoverableClassif icationDelegate 
interacts with the WriteAheadLogger for transaction information. The stor- 
age manager here serializes the WAL entries 

More design details are revealed in the class diagram of the storage- 
related aspects in Figure |4.4[ The Database contains stats of classificatio 
and is only written by the Speakerldent front-end. All, Database, Sample, 
Result, and ResultSet and TrainingSet implement Serializable to be 
able to be stored on disk or transferred over a network. 

The serialization of the WAL instance into the file is handled by the 
WALStorageManager class. The IStorageManager interface and its most 
generic implementation StorageManager also come from my MARF's|marf7j 
Storage] package. The StorageManager class provides the implementation 



of serialization of classes in plain binary as well as compressed binary for- 
mats. (It also has facilities to plug-in other storage or output formats, such 
as CSV, XML, HTML, and SQL, which derivatives must implement if they 
wish. 



4.3 Data Storage Format 

This section is about data storage issues and the details on the chosen under- 
lying implementation and ways of addressing those issues. For the details on 
the classical MARF storage susbsystem please refer to the Storage chapter 
in |Gro06] . 
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Figure 4.4: Storage Class Diagram 
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4.3.1 Log File Format 



The log is saved in the module-technology. log files for the server and 
client respectively in the application's current directory. As of this version, 



the file is produced with the help of the Logger class that is in marf .util 
(Another logging facility that was considered but not yet only used in WS 
with Tomcat is the Log4J tool jAGS"'"06] . which has a full-fledged log- 
ging engine.) The log file produced by Logger has a classical format of 
"[ time stamp ] : message". The logger intercepts all attempts to write 
to STDOUT or STDERR and makes a copy of them to the file. The output 
to SDTOUT and STDERR is also preserved. If the file remains between 
different runs, the log data is appended. 



4.4 Synchronization 

The notion of synchronization is crucial in an application that allows access 
to a shared resource or a data structure by multiple clients. This includes 
our DMARF. At the server side the synchronization must be maintained 
when the Database or TrainingSet objects are accessed through the server 
possibly by multiple clients. The way it is implemented in this version, the 
Database class becomes its own object monitor and all its relevant methods 
are made synchronized, thus locking entire object while it's accessed by a 
thread thereby providing data integrity. The whole-instance locking maybe 
a bit inefficient, but can be careful re-done by only marking some critical 
paths only and not the entire object. 

Furthermore, multiple server keep a copy of their own dats structures, 
including stock data, making it more concurrent. On top of that, the WS, 
RMI, and CORBA brokers act through a delegate implementation allowing 
to keep all the synchronization and business logic in one place and decouple 
communication from the logic. The rest is taken care of by the WAL. 
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4.5 Write- Ahead Logging and Recovery 

The recovery log design is based on the principle of the write-ahead log- 
ging. This means the transaction data is written first to the log, and upon 
successful return from writing the log, the transaction is committed. 

Checkpointing is done periodically of flushing all the transactions to the 
disk with the record of the latest committed transaction ID as a checkpoint. 
In the even of crash, upon restart, the WAL is read and the object states 
are recovered from the latest checkpoint. 

The design of the WAL algorithm in DMARF is modified such that the 
logged transaction data contains the "before" and "after" snapshots of the 
object in question (a training set, message, or the whole module itself). In 
part this is due to the fact that the transactions are wrapped around classical 
business logic, that does alter the objects on disk, so in the even of a failure 
the "before" snapshot is used to revert the object state on disk the way it 
was back before the transaction in question began. 

WAL grows up to a certain number of committed transactions. Periodic 
garbage collection on WAL and checkpointing are performed. At the garbage 
collection oldest aborted transactions are removed as well as up to a 1000 
committed transactions. WAL can be periodically backed up, shipped to 
another server for replication, or point- in-time recovery (PITR) and there 
are timestamps associated with each serialized transaction. 

In most part, WAL is pertinent to the Classification service as this is 
where most of writes are done during the training phase (in the classifica- 
tion phase it is only reading). Sample loading, preprocessing, and feature 
extraction services can also perform intermediate writes if asked, but most 
of the time they crunch the data and pass it around. The classification 
statistics is maintained at the application-specific front-end for now, and 
there writes are serialized. 
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4.6 Replication 

The replication is done by either the means of WAL (ship over WAL to 
another host and "replay" it along certain timeline). Another way is lazy 
update though the gossip architecture among replica. Delegates broadcast 
"whoHas(config)" requests before computing anything theselves; if shortly 
after no response received, the delegate issuing the request starts to compute 
the configuration itself, else a transfer is initiated from another delegate that 
have computed an identical configuration. 



Chapter 5 

Testing 



The conducted testing of (mostly CORBA) pipeline including single training 
test and a batch training on maximum four computers in separate build- 
ings. ||Makef ile| and [batch. sh| serve this purpose. If you intend to use 
them, make sure you have the server jars in |dist/| and properly configured 
dmarf -hosts . properties 

The tests were quite successful and terminating any of the service replicas 
and restarting it resumed normal operation of the pipeline in the batch 
mode. There more thorough testing is to be conducted as the project evolves 
from a proof-of-concept to a cleaner solution. 
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Chapter 6 

Conclusion 



Revision : 1.2 

Out of the three main distributed technologies learnt and used through 
the course (RMI, CORBA, and Web Services) to implement the MARF 
services, I managed to implement all three. 

The Java RMI technology seems to be the lowest-level of remote method 
invocation tools for programmers to use. Things like Jini, JMS tend to 
be more programmer-friendly. Additional limitation that RMI has as the 
requirement of the remote methods to throw the RemoteException and 
when generating stubs RMI- independent interface hierarchy docs work. 

A similar problem exists for CORBA, which generates even CORBA- 
specific data structures from the struct definitions that cannot be easily 
linked to the data structures used elsewhere throughout the program through 
inheritance or interfaces. 

The WS implementation from the Java-endpoint provided interface and 
and a couple of XML files was a natural extension of RMI implementation 
but with somewhat different semantics. The implementation aspect was not 
hard, but the deployment within a servlet container and WSDL compilation 
were a large headache. 

However, highly modular design allowed swapping module implementa- 
tions from one technology to another if need be making it very extensible by 
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the means of delegating the actual business logic to the a delegate classes. 
As an added bonus of that implementation, RMI, CORBA, WS services can 
communicate through TCP or UDP and do transaction. Likewise, all the 
synchronization efforts are undertaken by the delegate and the delegate is 
the single place to fix is there is something broken. Aside from the delegate 
class, a data adapter class for CORBA also contributes here to translate the 
data structures. 

6.1 Summary of Technologies Used 

The following were the most prominent technologies used throughout the 
implementation of the project: 

• J2SE (primarily 1.4) 

• Java IDL |Mic04j 

• Java RMI [ WWn5j 

• Java WS with JAX-RPC |Micn6) 

• Java Servlets |Mic05aj 

• Java Networking |Mic05b| 

• Echpse IDE |c+n4 

• Apache Ant |Con05j 

• Apache Jakarta Tomcat 5.5.12 [ Fou05] 

• GNU Make jSMSPOO] 

6.2 Future Work and Work-in-Progress 

Extend the remote framework to include other communication technologies 
(Jini, JMS, DCOM+, .NET Remoting) in communication-independent fash- 
ion and transplant that all for use in MARF |Gro06] . Additionally, complete 
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application GUI for the client and possibly server implementations. Finally, 
complete the advanced features of distributed systems such as disaster re- 
covery, fault tolerace, high availability and replication, and others with great 
deal of thorough testing. 
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